Feature Engineering¶
The main objectives for this notebook are:
- Develop a set of features that have a potential to improve our model's performance
- Investiage the relationships between our new features and your target
important steps¶
- Engineer a well argued feature (if with sources that's bonus point x2)
- Validate features after engineering
- Don't use blind (auto) feature engineering - waste of time
- Irrelevant Features Can Reduce Model Performance
- Difficulty in Model Interpretability and Explainability
- Lack of Alignment with Business Goals
- Design a feature engineering pipeline at the end of the notebook
Imports¶
import os
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import seaborn as sns
from feature_engine.selection import SmartCorrelatedSelection
import polars as pl
# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if not (path2add in sys.path):
sys.path.append(path2add)
from feature_engineering import (
aggregate_node_features,
feature_predictive_power,
get_graph_features,
)
pio.renderers.default = "notebook"
data = pl.read_parquet('../data/supervised_clean_data.parquet')
calls = pl.read_json('../data/supervised_call_graphs.json')
data.head(1)
| _id | inter_api_access_duration(sec) | api_access_uniqueness | sequence_length(count) | vsession_duration(min) | ip_type | num_sessions | num_users | num_unique_apis | source | classification | is_anomaly | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | f64 | f64 | f64 | i64 | str | f64 | f64 | f64 | str | str | bool |
| 0 | "1f2c32d8-2d6e-3b68-bc46-789469… | 0.000812 | 0.004066 | 85.643243 | 5405 | "default" | 1460.0 | 1295.0 | 451.0 | "E" | "normal" | false |
calls.head(1)
| _id | call_graph |
|---|---|
| str | list[struct[2]] |
| "1f2c32d8-2d6e-3b68-bc46-789469… | [{"1f873432-6944-3df9-8300-8a3cf9f95b35","5862055b-35a6-316a-8e20-3ae20c1763c2"}, {"8955faa9-0e33-37ad-a1dc-f0e640a114c2","a4fd6415-1fd4-303e-aa33-bb1830b5d9d4"}, … {"016099ea-6f20-3fec-94cf-f7afa239f398","6fa8ad53-2f0d-3f44-8863-139092bfeda9"}] |
Since the main dataset already contains engineered features, there's not much opportunity to do feature engineering there. So, additional features will be created using the graph data that comes from supervised_call_graphs.json
Process Graph Data¶
calls_processed = (
calls.with_columns(
pl.col("call_graph").list.eval(
pl.element().struct.rename_fields(["from", "to"])
)
)
.explode("call_graph")
.unnest("call_graph")
)
calls_processed.head()
| _id | from | to |
|---|---|---|
| str | str | str |
| "1f2c32d8-2d6e-3b68-bc46-789469… | "1f873432-6944-3df9-8300-8a3cf9… | "5862055b-35a6-316a-8e20-3ae20c… |
| "1f2c32d8-2d6e-3b68-bc46-789469… | "8955faa9-0e33-37ad-a1dc-f0e640… | "a4fd6415-1fd4-303e-aa33-bb1830… |
| "1f2c32d8-2d6e-3b68-bc46-789469… | "85754db8-6a55-30b7-8558-dec75f… | "85754db8-6a55-30b7-8558-dec75f… |
| "1f2c32d8-2d6e-3b68-bc46-789469… | "9f08fee1-953c-3801-b254-c0256f… | "876b4958-7df1-3b2b-9def-1a22f1… |
| "1f2c32d8-2d6e-3b68-bc46-789469… | "857c4b20-3057-30e0-9ca3-d6f5c3… | "857c4b20-3057-30e0-9ca3-d6f5c3… |
Feature Engineering¶
We can see that each graph has a separate _id that can be later used to join to the main dataset. A graph consistst out of source and destination nodes which refer to the available API calls.
Basic Graph Level Features¶
The most basic graph-level that we can engineer are:
- Number of edges (connections)
- Number of nodes (APIs)
These features can be useful since most behaviours are going to have a "normal" range of APIs that they contact. If this number is too large or too small, this might be an indication of anomalous activity.
graph_features = calls_processed.group_by('_id').agg(
pl.len().alias('n_connections'),
pl.col('from'),
pl.col('to')
).with_columns(
pl.concat_list('to', 'from').list.unique().list.len().alias('n_unique_nodes')
).select([
'_id',
'n_connections',
'n_unique_nodes'
])
graph_features.sample(3)
| _id | n_connections | n_unique_nodes |
|---|---|---|
| str | u32 | u32 |
| "fd384a3f-5e0a-39b1-9231-c6c03e… | 2 | 2 |
| "06e61dbd-14b3-3ec7-88ce-1bc012… | 69 | 44 |
| "0331e900-3071-3cb2-840b-148317… | 36 | 10 |
Node Level Features¶
Since graphs consist out of nodes, we can engineer a set of features around specific nodes (APIs). We can calculate:
- Node degrees - the number of edges that come from/into a node. Very highly connected nodes can look anomalous.
- Node centrality - there are various centrality measures (e.g. Page Rank) but they all try to estimate how important to the whole graph is a specific node. This feature could be useful because a behaviour pattern that doesn't touch any of the "central" APIs would look anomalous
These features can be broken down into:
- global features - measure node attributes across all the graphs
- local features - measure node attributes across a specific graph
calls_processed = calls_processed.with_columns(
global_source_degrees = pl.len().over(pl.col('from')),
global_dest_degrees = pl.len().over(pl.col('to')),
local_source_degrees = pl.len().over(pl.col('from'), pl.col('_id')),
local_dest_degrees = pl.len().over(pl.col('to'), pl.col('_id'))
)
calls_processed.sample(3)
| _id | from | to | global_source_degrees | global_dest_degrees | local_source_degrees | local_dest_degrees |
|---|---|---|---|---|---|---|
| str | str | str | u32 | u32 | u32 | u32 |
| "8af1e42f-4428-33ab-8267-a909eb… | "24338956-3f43-3e08-9445-676218… | "120f167d-0a23-31ba-a78f-cc72d7… | 5203 | 6900 | 4 | 6 |
| "1e1ef442-ec30-3853-8da4-80e934… | "9f08fee1-953c-3801-b254-c0256f… | "cb101022-d016-32d1-8987-4e4f4c… | 12171 | 1009 | 46 | 4 |
| "59b79b42-951c-33b9-b9d9-c5f864… | "e0275d86-2e3f-3ba5-bf74-836b5b… | "d100d39e-97ee-3f4c-8006-ee90a8… | 805 | 185 | 4 | 1 |
Now that the node-level features are calculated, we need to aggregate them for a specific graph (_id). When aggregating, we can calcualte average, std, min, and max statistics for every feature to capture the distribution well.
node_features_agg = aggregate_node_features(
calls_processed,
node_features=[
"global_source_degrees",
"global_dest_degrees",
"local_source_degrees",
"local_dest_degrees",
],
by="_id",
)
graph_features = graph_features.join(node_features_agg, on="_id")
graph_features.head()
| _id | n_connections | n_unique_nodes | avg_global_source_degrees | min_global_source_degrees | max_global_source_degrees | std_global_source_degrees | avg_global_dest_degrees | min_global_dest_degrees | max_global_dest_degrees | std_global_dest_degrees | avg_local_source_degrees | min_local_source_degrees | max_local_source_degrees | std_local_source_degrees | avg_local_dest_degrees | min_local_dest_degrees | max_local_dest_degrees | std_local_dest_degrees |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | f64 | u32 | u32 | f64 | f64 | u32 | u32 | f64 | f64 | u32 | u32 | f64 | f64 | u32 | u32 | f64 |
| "93856471-799b-3d52-8ab7-7bda2b… | 4 | 4 | 1964.0 | 6 | 5203 | 2248.374672 | 922.5 | 6 | 2074 | 873.419525 | 1.0 | 1 | 1 | 0.0 | 1.0 | 1 | 1 | 0.0 |
| "52edc47c-a42b-35f3-a589-651252… | 87 | 40 | 83.862069 | 1 | 596 | 169.22334 | 125.034483 | 1 | 1151 | 315.571775 | 3.436782 | 1 | 7 | 1.633239 | 3.528736 | 1 | 7 | 1.866716 |
| "c21602d0-a7be-332b-88c6-bf54bf… | 704 | 193 | 5191.326705 | 7 | 32071 | 8284.747802 | 5821.362216 | 7 | 22416 | 7103.373958 | 11.457386 | 1 | 55 | 14.23071 | 13.110795 | 1 | 42 | 12.584472 |
| "2a0e5f64-7376-3236-a55f-a72d0d… | 12 | 10 | 8562.916667 | 403 | 32071 | 11569.040005 | 6866.916667 | 813 | 22416 | 9075.869309 | 1.5 | 1 | 2 | 0.522233 | 1.333333 | 1 | 2 | 0.492366 |
| "c2b37668-545f-3f3e-8bf8-f7732e… | 3 | 3 | 1294.0 | 52 | 2406 | 1182.372192 | 763.666667 | 55 | 1213 | 621.032474 | 1.0 | 1 | 1 | 0.0 | 1.0 | 1 | 1 | 0.0 |
Feature Selection¶
Feature selection will be done using 2 steps:
- Quality checks - if the feature is constant or has too many missing values (>= 95%) it will be dropped
- Correlation analysis - if features have very high correlation (>= 95%) with each other, they can be dropped as well
engineered_features = graph_features.columns[1:]
engineered_features
['n_connections', 'n_unique_nodes', 'avg_global_source_degrees', 'min_global_source_degrees', 'max_global_source_degrees', 'std_global_source_degrees', 'avg_global_dest_degrees', 'min_global_dest_degrees', 'max_global_dest_degrees', 'std_global_dest_degrees', 'avg_local_source_degrees', 'min_local_source_degrees', 'max_local_source_degrees', 'std_local_source_degrees', 'avg_local_dest_degrees', 'min_local_dest_degrees', 'max_local_dest_degrees', 'std_local_dest_degrees']
Quality Checks¶
null_counts = graph_features.null_count().transpose(include_header=True, header_name='col', column_names=['null_count'])
null_counts.filter(pl.col('null_count') > 0)
| col | null_count |
|---|---|
| str | u32 |
| "std_global_source_degrees" | 42 |
| "std_global_dest_degrees" | 42 |
| "std_local_source_degrees" | 42 |
| "std_local_dest_degrees" | 42 |
static_features = graph_features.select(engineered_features).std().transpose(include_header=True, header_name='col', column_names=['std'])
static_features.filter(pl.col('std') == 0)
| col | std |
|---|---|
| str | f64 |
Observations:
- 4 columns have missing values. All of them calculate standard deviation
Impact
- No features will be dropped for quality reasons
Correlation Analysis¶
As we can see, global degrees
feature_corrs = graph_features.select(engineered_features).to_pandas().dropna().corr()
feature_corrs.index = feature_corrs.columns
matrix = np.triu(feature_corrs)
fig = plt.figure(figsize=(20, 10))
sns.heatmap(feature_corrs, annot=True, mask=matrix)
<Axes: >
We can see clear groups of highyl correlated features. Hence, let's apply SmartCorrelatedSelection to reduce the feature set of engineered features
features_pd = graph_features.select(engineered_features).to_pandas().dropna()
tr = SmartCorrelatedSelection(
variables=None,
method="pearson",
threshold=0.95,
missing_values="raise",
selection_method="variance",
estimator=None,
)
tr.fit(features_pd)
print('Features to drop:')
for f in tr.features_to_drop_:
print(f)
Features to drop: std_global_dest_degrees n_unique_nodes max_local_source_degrees max_local_dest_degrees std_local_dest_degrees avg_local_dest_degrees avg_local_source_degrees
Observations:
- Engineered features have groups of high correlation
Impact
['n_unique_nodes', 'std_global_dest_degrees', 'avg_local_source_degrees', 'max_local_source_degrees', 'avg_local_dest_degrees', 'max_local_dest_degrees' 'std_local_dest_degrees']are dropped from the features list due to belonging to a high correlation set and having lower variance than the remaining feature
EDA for Remaining Engineered Features¶
remaining_engineered_features = list(set(features_pd).difference(set(tr.features_to_drop_)))
graph_features = graph_features.join(data.select(['_id', 'is_anomaly']), on='_id')
scores = []
for f in remaining_engineered_features:
print("Feature Analysis:", f)
score = feature_predictive_power(graph_features, f, "is_anomaly")
scores.append(score)
Feature Analysis: max_global_dest_degrees Predictive Power Score: 0.5921000242233276
Feature Analysis: n_connections Predictive Power Score: 0.5871999859809875
Feature Analysis: avg_global_source_degrees Predictive Power Score: 0.328900009393692
Feature Analysis: min_local_dest_degrees Predictive Power Score: 0.007799999788403511
Feature Analysis: min_global_source_degrees Predictive Power Score: 0.5494999885559082
Feature Analysis: max_global_source_degrees Predictive Power Score: 0.36739999055862427
Feature Analysis: avg_global_dest_degrees Predictive Power Score: 0.3370000123977661
Feature Analysis: min_local_source_degrees Predictive Power Score: 0.0
Feature Analysis: std_local_source_degrees Predictive Power Score: 0.5327000021934509
Feature Analysis: min_global_dest_degrees Predictive Power Score: 0.5932000279426575
Feature Analysis: std_global_source_degrees Predictive Power Score: 0.44369998574256897
pd.Series(scores, index=remaining_engineered_features).sort_values(ascending=False)
min_global_dest_degrees 0.5932 max_global_dest_degrees 0.5921 n_connections 0.5872 min_global_source_degrees 0.5495 std_local_source_degrees 0.5327 std_global_source_degrees 0.4437 max_global_source_degrees 0.3674 avg_global_dest_degrees 0.3370 avg_global_source_degrees 0.3289 min_local_dest_degrees 0.0078 min_local_source_degrees 0.0000 dtype: float32
Observations:
- Most of the engineered features have relatively highe predictiveness score
- The most predictive features are
global - Features with no predictive power measure minimum degrees of local graphs
- Relationships between engineered features and the target are not-linear
Impact
min_local_dest_degreesandmin_local_source_degreescan be dropped- Tree based models need to be used to capture the engineered relationships
remaining_engineered_features = [f for f in remaining_engineered_features if f not in ['min_local_dest_degrees', 'min_local_source_degrees']]
print('Final engineered featureset:')
print(remaining_engineered_features)
Final engineered featureset: ['max_global_dest_degrees', 'n_connections', 'avg_global_source_degrees', 'min_global_source_degrees', 'max_global_source_degrees', 'avg_global_dest_degrees', 'std_local_source_degrees', 'min_global_dest_degrees', 'std_global_source_degrees']
Feature Engineering Pipeline¶
selected_features = [
"max_global_source_degrees",
"avg_global_source_degrees",
"min_global_dest_degrees",
"std_local_source_degrees",
"max_global_dest_degrees",
"min_global_source_degrees",
"std_global_source_degrees",
"n_connections",
"avg_global_dest_degrees",
]
calls = (
(
pl.read_json("../data/supervised_call_graphs.json")
.with_columns(
pl.col("call_graph").list.eval(
pl.element().struct.rename_fields(["from", "to"])
)
)
.explode("call_graph")
.unnest("call_graph")
)
.with_columns(
global_source_degrees=pl.len().over(pl.col("from")),
global_dest_degrees=pl.len().over(pl.col("to")),
local_source_degrees=pl.len().over(pl.col("from"), pl.col("_id")),
local_dest_degrees=pl.len().over(pl.col("to"), pl.col("_id")),
)
.pipe(get_graph_features)
.select(["_id"] + selected_features)
)
pl.read_parquet("../data/supervised_clean_data.parquet").join(
calls, on="_id"
).write_parquet("../data/supervised_clean_data_w_features.parquet")
final_data = pl.read_parquet("../data/supervised_clean_data_w_features.parquet")
final_data.head()
| _id | inter_api_access_duration(sec) | api_access_uniqueness | sequence_length(count) | vsession_duration(min) | ip_type | num_sessions | num_users | num_unique_apis | source | classification | is_anomaly | max_global_source_degrees | avg_global_source_degrees | min_global_dest_degrees | std_local_source_degrees | max_global_dest_degrees | min_global_source_degrees | std_global_source_degrees | n_connections | avg_global_dest_degrees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | f64 | f64 | f64 | i64 | str | f64 | f64 | f64 | str | str | bool | u32 | f64 | u32 | f64 | u32 | u32 | f64 | u32 | f64 |
| 0 | "1f2c32d8-2d6e-3b68-bc46-789469… | 0.000812 | 0.004066 | 85.643243 | 5405 | "default" | 1460.0 | 1295.0 | 451.0 | "E" | "normal" | false | 32071 | 4055.665012 | 2 | 30.889073 | 22416 | 3 | 6840.719715 | 2821 | 4547.629918 |
| 1 | "4c486414-d4f5-33f6-b485-24a8ed… | 0.000063 | 0.002211 | 16.166805 | 519 | "default" | 9299.0 | 8447.0 | 302.0 | "E" | "normal" | false | 32071 | 5174.526772 | 12 | 19.458242 | 22416 | 3 | 7527.970754 | 1270 | 5858.023622 |
| 2 | "7e5838fc-bce1-371f-a3ac-d8a0b2… | 0.004481 | 0.015324 | 99.573276 | 6211 | "default" | 255.0 | 232.0 | 354.0 | "E" | "normal" | false | 32071 | 4174.369415 | 2 | 17.700993 | 22416 | 3 | 7048.385421 | 1589 | 4814.517306 |
| 3 | "82661ecd-d87f-3dff-855e-378f7c… | 0.017837 | 0.014974 | 69.792793 | 8292 | "default" | 195.0 | 111.0 | 116.0 | "E" | "normal" | false | 32071 | 5867.786492 | 10 | 7.321257 | 22416 | 12 | 7153.580321 | 459 | 6689.276688 |
| 4 | "d62d56ea-775e-328c-8b08-db7ad7… | 0.000797 | 0.006056 | 14.952756 | 182 | "default" | 272.0 | 254.0 | 23.0 | "E" | "normal" | false | 32071 | 6914.842697 | 38 | 3.163085 | 22416 | 53 | 10320.004581 | 89 | 5613.41573 |